Are you a fan of soccer? Have you ever heard of the century competition between Leo Messi and Cristiano Rolnaldo? Are you interested in numerically finding out which player is better than the others? This is the target of my project, which is predicting player ratings from their match performances.
Captain America
Since this is a supervised learning project, people might wonder, do you have a “test answer” for this kind of question. The answer is YES!! FIFA(Fédération Internationale De Football Association, aka the international soccer association, collaborated with EA gaming and produced a monumental game called FIFA. Each year, these two partners will come up with a new version of FIFA with better motion, texture, lighting, shadow detail and so on, to provide the gamers all around the world with a brand new experience in soccer. Starting from 1993 to now, FIFA and EA have developed from their first generation of FIFA International Soccer, which released in 1993, to FIFA 23, which was released by the end of 2022.
(On the left is the FIFA International Soccer, and on the right is FIFA23)
From this game, each player are rewarded with a overall rating which is calculated and watched by professional analyst from FIFA by using some mistry algorithms. Some people might doubt the credibility of this stats, because they are mainly used in the gaming. However, I see no concern of it, because the way they collected their data. Sometimes, the analysts will come to the matches and look at the match stats for every player, and also there is time when analysts will come to the training base for most of the clubs and provide some task for the player to do so that they will have a good and accurate estimate of the overall rating. This is why players who might be sitting on bench or injured for the most of the season can also get a reasonable rating after all.
Below are the examples of some top players rating and other minor attributes like shooting, passing, defending index for gamers to know better which area on the field is this particular player best of.
FIFA23 cards
Aren’t you excited about this? Let’s get started!
Fifa21 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_22/players_22.csv")
Fifa20 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_21/players_21.csv")
Fifa19 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_20/players_20.csv")
Fifa18 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_19/players_19.csv")
Fifa17 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_18/players_18.csv")
Fifa16 <- read.csv("E:/Pstat231/FIFA_Player_stats/players_17/players_17.csv")
Fifa21 <- Fifa21 %>% select(c("long_name","short_name","overall"))
Fifa20 <- Fifa20 %>% select(c("long_name","short_name","overall"))
Fifa19 <- Fifa19 %>% select(c("long_name","short_name","overall"))
Fifa18 <- Fifa18 %>% select(c("long_name","short_name","overall"))
Fifa17 <- Fifa17 %>% select(c("long_name","short_name","overall"))
Fifa16 <- Fifa16 %>% select(c("long_name","short_name","overall"))
PlayerStats21 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats21.csv")
PlayerStats20 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats20.csv")
PlayerStats19 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats19.csv")
PlayerStats18 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats18.csv")
PlayerStats17 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats17.csv")
PlayerStats16 <- read.csv("E:/Pstat231/Real_Player_stats/PlayerStats16.csv")
head(Fifa19) %>%
arrange(desc(overall))
## long_name short_name overall
## 1 Lionel Andrés Messi Cuccittini L. Messi 94
## 2 Cristiano Ronaldo dos Santos Aveiro Cristiano Ronaldo 93
## 3 Neymar da Silva Santos Júnior Neymar Jr 92
## 4 Eden Hazard E. Hazard 91
## 5 Kevin De Bruyne K. De Bruyne 91
## 6 Jan Oblak J. Oblak 91
Here are some top rated players in the FIFA21.
Before we even start building our models, there is a extremely serious problem of our data sets. Since I found my response variable and predictors from two different websites, when I was trying to combine or left join them into one common data set, the NAME column which I’m using in my left join will not work well because these two websites sometimes call the same player with different name.
For example, for the same player Ederson, who is a Brazilian Goalkeeper and plays for Brazil National Team and Manchester City, are called by his full name – Ederson Santana de Moraes – in one website and just Ederson in another. As you might see, the other Website are not just playing players by their First nor their Last name, but by their common or most familiar name for the fans.
This problem really gives me a headache, I had to figure out a way that could be used for most of our observations.
Data_Combine <- function(Fifa,PlayerStats){
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
split_names_PlayerStats <- function(df) {
df$first_name <- NA
df$last_name <- word(df$Player.Name, 1)
for (i in 1:nrow(df)) {
if (nwords(df$Player.Name[i]) == 2) {
df$first_name[i] <- word(df$Player.Name[i], 1)
df$last_name[i] <- word(df$Player.Name[i], 2)
} else if (nwords(df$Player.Name[i]) == 3) {
df$first_name[i] <- word(df$Player.Name[i], 1)
df$last_name[i] <- paste(word(df$Player.Name[i], start = 2), ... = word(df$Player.Name[i], 3), sep = " ")
} else if (nwords(df$Player.Name[i]) == 1) {
df$last_name[i] <- word(df$Player.Name[i], 1)
}
}
return(df)
}
split_names_Fifa <- function(df) {
df$first_name <- NA
df$last_name <- word(df$short_name, 1)
for (i in 1:nrow(df)) {
if (nwords(df$short_name[i]) == 2) {
df$first_name[i] <- word(df$short_name[i], 1)
df$last_name[i] <- word(df$short_name[i], 2)
} else if (nwords(df$short_name[i]) == 3) {
df$first_name[i] <- word(df$short_name[i], 1)
df$last_name[i] <- paste(word(df$short_name[i], start = 2), word(df$short_name[i], 3), sep = " ")
} else if (nwords(df$short_name[i]) == 1) {
df$last_name[i] <- word(df$short_name[i], 1)
}
}
return(df)
}
get_initials <- function(strings_vector) {
initials <- character(length(strings_vector))
for (i in 1:length(strings_vector)) {
if (is.na(strings_vector[i])) {
initials[i] <- NA
} else if (nchar(strings_vector[i]) > 1) {
initials[i] <- substr(strings_vector[i], 1, 1)
} else {
initials[i] <- strings_vector[i]
}
}
return(initials)
}
PlayerStats <- split_names_PlayerStats(PlayerStats)
PlayerStats$first_name <- get_initials(PlayerStats$first_name)
Fifa <- split_names_Fifa(Fifa)
Fifa$first_name <- get_initials(Fifa$first_name)
remove_duplicates_by_name <- function(df, first_name_col = "first_name", last_name_col = "last_name") {
unique_df <- df[!duplicated(df[c(first_name_col, last_name_col)]), ]
return(unique_df)
}
PlayerStats <- remove_duplicates_by_name(PlayerStats)
Fifa <- remove_duplicates_by_name(Fifa)
join_data_frames <- function(df1, df2, first_name_col = "first_name", last_name_col = "last_name") {
merged_df <- merge(df1, df2, by.x = c(first_name_col, last_name_col), by.y = c(first_name_col, last_name_col), all.x = TRUE)
cleaned_df <- na.omit(merged_df)
return(cleaned_df)
}
Data<-join_data_frames(Fifa,PlayerStats)
return(Data)
}
Data21 <- Data_Combine(Fifa21,PlayerStats21)
Data20 <- Data_Combine(Fifa20,PlayerStats20)
Data19 <- Data_Combine(Fifa19,PlayerStats19)
Data18 <- Data_Combine(Fifa18,PlayerStats18)
Data17 <- Data_Combine(Fifa17,PlayerStats17)
Data16 <- Data_Combine(Fifa16,PlayerStats16)
Soccer <- rbind(Data16,Data17,Data18,Data19,Data20,Data21) %>%
select(-c(first_name,last_name,short_name, long_name, Player.Name,Team,GP), #
-c(Y,YR,R), #Yellow and Red cards does not really matter in ratings
-c(SOG,ATTDR,BCM,BCS,TOFF),
-c(SA,AOP,ACR,BCC,ASP,CCSP,CCOP,ACRO,POSL),
-c(TKLW,LMT,LPOPP,PENT,CLROL,EG,ES,OWN))%>%
filter(.,POS != "G")
Soccer %>%
head()
## overall League POS MIN G A S INT CR CC BLK TKL P AW BR DR DW TOUCH TBOX
## 3 80 LIGA D 878 0 0 2 32 1 2 5 18 284 17 42 2 41 497 5
## 5 72 BUND M 1321 1 0 9 34 1 7 7 42 489 32 61 3 100 859 9
## 10 74 SERI M 1071 2 0 16 28 10 7 1 18 483 16 99 17 68 779 14
## 12 73 SERI D 867 0 0 8 15 54 9 3 18 226 14 48 10 52 600 7
## 13 75 EPL M 121 0 0 4 1 9 1 0 5 24 0 6 2 9 69 7
## 15 69 LIGA F 230 0 0 1 2 4 5 1 11 68 3 19 2 25 141 4
## PFT DL AFZP BLKCR PK ALB ATB
## 3 24 30 70 3 0 14 0
## 5 97 93 233 4 0 28 0
## 10 127 76 281 1 0 32 1
## 12 61 48 127 10 0 15 0
## 13 6 18 14 0 0 1 0
## 15 25 22 44 2 0 2 0
Oh my gosh that’s a huge chunk of code right
there and finally my job of tidying and combining data is done. I have
narrowed down the data files for all players rating from the Top 7
leagues from 2016-2021 to just one, which I have named “Soccer” in our
terminal. As a soccer fan for more than 15 years, I have excluded
several predictors that seem irrelevant to a player’s overall rating,
such as their names, the number of games they have played, the
Red/Yellow cards they have received, and own goals. Additionally, I have
excluded some predictors that are highly correlated with existing ones.
For example, while I included the predictor “SHOOT,” which indicates the
number of shots completed, I excluded “SOG,” which indicates shots on
goal, since these two predictors are highly related and derived from
each other. Moving forward with model building, I have chosen to exclude
them from my Soccer file. Furthermore, I have decided to exclude all
goalkeepers from our analysis since most of the predictors are basically
irrelevant for them.
dim(Soccer)
## [1] 13863 26
vis_miss(Soccer)
summary(Soccer)
## overall League POS MIN
## Min. :47.0 Length:13863 Length:13863 Min. : 0
## 1st Qu.:67.0 Class :character Class :character 1st Qu.: 383
## Median :72.0 Mode :character Mode :character Median :1194
## Mean :71.7 Mean :1276
## 3rd Qu.:76.0 3rd Qu.:2061
## Max. :94.0 Max. :3787
## G A S INT
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 2.0
## Median : 1.000 Median : 0.000 Median : 10.00 Median : 11.0
## Mean : 1.905 Mean : 1.335 Mean : 17.74 Mean : 16.7
## 3rd Qu.: 2.000 3rd Qu.: 2.000 3rd Qu.: 24.00 3rd Qu.: 26.0
## Max. :41.000 Max. :21.000 Max. :208.00 Max. :112.0
## CR CC BLK TKL
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 1.00 1st Qu.: 2.00 1st Qu.: 0.000 1st Qu.: 6.00
## Median : 8.00 Median : 7.00 Median : 2.000 Median : 18.00
## Mean : 25.64 Mean : 13.19 Mean : 4.419 Mean : 23.55
## 3rd Qu.: 33.00 3rd Qu.: 19.00 3rd Qu.: 6.000 3rd Qu.: 36.00
## Max. :382.00 Max. :136.00 Max. :61.000 Max. :146.00
## P AW BR DR
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 104.0 1st Qu.: 4.0 1st Qu.: 18.00 1st Qu.: 2.00
## Median : 358.0 Median : 13.0 Median : 60.00 Median : 7.00
## Mean : 475.7 Mean : 22.5 Mean : 72.35 Mean : 12.79
## 3rd Qu.: 724.5 3rd Qu.: 31.0 3rd Qu.:114.00 3rd Qu.: 18.00
## Max. :2903.0 Max. :322.0 Max. :393.00 Max. :185.00
## DW TOUCH TBOX PFT
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 21.00 1st Qu.: 224.0 1st Qu.: 4.00 1st Qu.: 24.0
## Median : 66.00 Median : 721.0 Median : 16.00 Median : 80.0
## Mean : 74.88 Mean : 844.7 Mean : 29.58 Mean : 113.8
## 3rd Qu.:116.00 3rd Qu.:1326.0 3rd Qu.: 39.00 3rd Qu.: 167.0
## Max. :403.00 Max. :3908.0 Max. :337.00 Max. :1069.0
## DL AFZP BLKCR PK
## Min. : 0.00 Min. : 0 Min. : 0.00 Min. : 0.0000
## 1st Qu.: 23.00 1st Qu.: 57 1st Qu.: 0.00 1st Qu.: 0.0000
## Median : 65.00 Median : 188 Median : 1.00 Median : 0.0000
## Mean : 74.99 Mean : 246 Mean : 2.53 Mean : 0.2357
## 3rd Qu.:112.00 3rd Qu.: 366 3rd Qu.: 3.00 3rd Qu.: 0.0000
## Max. :441.00 Max. :1707 Max. :42.00 Max. :15.0000
## ALB ATB
## Min. : 0.00 Min. : 0.0000
## 1st Qu.: 4.00 1st Qu.: 0.0000
## Median : 18.00 Median : 0.0000
## Mean : 32.05 Mean : 0.6297
## 3rd Qu.: 47.00 3rd Qu.: 1.0000
## Max. :301.00 Max. :31.0000
Soccer$overall <- as.numeric(Soccer$overall)
Soccer$POS <- factor(Soccer$POS)
Soccer$League <- factor(Soccer$League)
Good thing is there is no missing data at all in our data. And This is a fair amount of observations and I’m quite confident about the process given the fact that the more observation, the more accuracy we can obtain.
Soccer %>%
select(where(is.numeric)) %>%
cor() %>%
corrplot(type="upper", tl.cex=0.75)
As we can see, there are a lot of positive correlations including some
really strong ones. For example, Min played has some serious
correlations across all predictors. This makes sense because the more
mins any play spends on the field, his amount of touches, passes and
even goals will significantly increase.
ggplot(data=Soccer,mapping = aes(x=forcats::fct_infreq(League))) +
geom_bar(fill = "Orange", color = "black") +
labs(title="Numbers of Players in Each League",x = "League") +
theme(panel.background = element_rect(fill = "#f0f0f0"),
panel.grid.minor = element_blank(),
axis.title = element_text(size = 12))
ggplot(data=Soccer,mapping = aes(x=forcats::fct_infreq(POS))) +
geom_bar(fill = "Dark Green", color = "black") +
labs(title = "Number of Players in Each Position",x = "Position") +
theme(panel.background = element_rect(fill = "#f0f0f0"),
panel.grid.minor = element_blank(),
axis.title = element_text(size = 12))
ggplot(data=Soccer, aes(x="", y= MIN)) +
geom_boxplot(fill="orange", color="black") +
labs(title="Mins played Distribution", y="Minutes Played") +
theme_minimal()
summary(Soccer$MIN)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 383 1194 1276 2061 3787
Soccer <- Soccer[Soccer$MIN >= 382,]
This box plot tells me a surprising fact that majority of soccer players even they are playing in top leagues, they would spend a significant amount of time sitting on bench. This plot is highly skewed to the right with with 25 percentile of the players play less than 500 mins for the entire 2021 season, which is approximately 5 games given total about 60 games possible for any professional soccer team in a regular season. That’s sad isn’t it :(
Therefore, I have to exclude the lower 25% of the observations by mins played because their match performance are most likely not reflective to their real abilities.
ggplot(data=Soccer, aes(x=TOUCH, y=P, fill=POS, group = POS)) +
geom_boxplot() +
labs(title="Relationship Between Touches and Passes by Position", x="Touches", y="Passes") +
theme_minimal()
ggplot(Soccer, aes(P)) +
geom_bar() +
labs(title="Numbers of Passes Distribution") +
xlim(0, 2500) +
ylim(0, 30) +
scale_fill_manual()
## Warning: Removed 13 rows containing non-finite values (`stat_count()`).
ggplot(Soccer, aes(x=P, y=TOUCH)) +
geom_jitter(width = 0.5, size = 1) +
geom_smooth(method = "lm", se =F, col="Green") +
labs(title = "Touches vs. Passes")
## `geom_smooth()` using formula = 'y ~ x'
ggplot(Soccer, aes(x=MIN, y=TOUCH)) +
geom_jitter(width = 0.5, size = 1) +
geom_smooth(method = "lm", se =F, col="Green") +
labs(title = "Minutes Played vs. Touches")
## `geom_smooth()` using formula = 'y ~ x'
As we can see here, there is a strong positive linear relationship between touches and passes, touches and mins played. These are important notes we need to consider when we set up the recipe, otherwise it will ruin the effectiveness of our models.
ggplot(data = Soccer, aes(x = POS, y = S, fill = TOUCH)) +
geom_violin(width = 0.7, color = "Darkgreen") +
stat_summary(fun = median, geom = "point", shape = 21, size = 3, color = "white") +
scale_y_continuous(limits = c(0, 100)) +
labs(title = "Distribution of Touches inside the Box vs. Shoot by Position", x = "Position", y = "Number of Shoot") +
theme_minimal()
## Warning: Removed 125 rows containing non-finite values (`stat_ydensity()`).
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: Removed 125 rows containing non-finite values (`stat_summary()`).
There is also a strong correlation between touches inside the box and shoot, and, furthermore, Forward and Forward/Midfield players have a higher number of touches inside the box and shoot comparing with players of other positions.
In conclusion, it is the fact that we have a lot of correlated variables in our predictors and all of the relationships make sense. In our recipe set up, I have to take into account of these correlations in it, otherwise this could seriously affect our results.
set.seed(6656)
Soccer_split <- initial_split(Soccer,strata = overall,prop = 3/4)
Soccer_test <- testing(Soccer_split)
Soccer_train <- training(Soccer_split)
Soccer_recipe <- recipe(overall ~ . , data=Soccer_train)%>%
step_dummy(all_nominal_predictors()) %>%
step_center(all_nominal_predictors()) %>%
step_scale(all_nominal_predictors()) %>%
step_interact(terms = ~ P:TOUCH) %>%
step_interact(terms = ~ TBOX:S) %>%
step_interact(terms = ~ MIN:TOUCH)
After initial spliting, setting recipe is probably one of the most important step of my project overall. Making sure factors,like League and Position, are dummy coded. Capturing any correlations between predictors in our recipe set up can improve the accuracy of our model and avoid collinearity. Based in our correlation plot, we can see there are some variables such as S and TBOX, TOUCH and P. Ignoring such correlation could cause the model to be inaccurate. Lastly, I centered and scaled all the nominal predictors.
Soccer_folds <- vfold_cv(Soccer_train, v = 5, strata = overall)
Using K-fold cross validation is important because the performance of a model on a single dataset can be highly dependent on the particular data points included in that dataset, and may not generalize well to other datasets. By using k-fold cross-validation, we can get a more reliable estimate of a model’s performance, since it is evaluated on multiple partitions of the data, rather than just one.
K-fold cross-validation can also help to identify issues such as overfitting, where the model performs well on the training data but poorly on new, unseen data. By evaluating the model on different folds of the data, we can get a better sense of how well the model is likely to generalize to new data.
Overall, k-fold cross-validation is a useful technique for evaluating and comparing machine learning models, and is often used in research and in practice to help ensure that models are robust and reliable. We saved our folded data into Soccer_folds.
Now, after setting up a satisfactory recipe and cross validation, it is time to build our models. In the following steps, I will conduct 8 different models, including Linear Regression, K-nearest Neighbors, Elastic Net, Ridge Regression, Lasso Regression, Boosted Trees, and Random Forest. In addition, Root Mean Square Error – RMSE – will become my metric when it comes to measure the accuracy of my models. In another words, when training our models in Soccer_folds, the one has the lowest RMSE will become our best model. In the end I will preceed with two best models in our final fitting just to see how result could change from using the best or the second best model.
Furthermore, because many of the models need some sort of tuning, meaning that we are testing different levels and range of those hyperparameters and see which one gives us the best results, using autoplot function in R enables us to observe the impact of every tuned parameter on each model’s performance. The autoplots illustrate the model’s performance through its RMSE score, where lower scores indicate better performance.
Let’s see what happens.
lm_reg <- linear_reg() %>%
set_engine("lm")
lm_workflow <- workflow() %>%
add_model(lm_reg) %>%
add_recipe(Soccer_recipe)
lm_fit <- lm_workflow %>%
fit_resamples(resamples = Soccer_folds)
best_lm <- collect_metrics(lm_fit)%>%
slice(1);best_lm
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 rmse standard 4.23 5 0.0698 Preprocessor1_Model1
knn_mod <- nearest_neighbor(neighbors = tune()) %>%
set_mode("regression") %>%
set_engine("kknn")
knn_workflow <- workflow() %>%
add_model(knn_mod) %>%
add_recipe(Soccer_recipe)
neighbors_grid_knn <- grid_regular(neighbors(range = c(1, 10)), levels = 10)
tune_res_knn <- tune_grid(
object = knn_workflow,
resamples = Soccer_folds,
grid = neighbors_grid_knn,
control = control_grid(verbose = TRUE)
)
## i Fold1: preprocessor 1/1
## ✓ Fold1: preprocessor 1/1
## i Fold1: preprocessor 1/1, model 1/1
## ✓ Fold1: preprocessor 1/1, model 1/1
## i Fold1: preprocessor 1/1, model 1/1 (predictions)
## i Fold2: preprocessor 1/1
## ✓ Fold2: preprocessor 1/1
## i Fold2: preprocessor 1/1, model 1/1
## ✓ Fold2: preprocessor 1/1, model 1/1
## i Fold2: preprocessor 1/1, model 1/1 (predictions)
## i Fold3: preprocessor 1/1
## ✓ Fold3: preprocessor 1/1
## i Fold3: preprocessor 1/1, model 1/1
## ✓ Fold3: preprocessor 1/1, model 1/1
## i Fold3: preprocessor 1/1, model 1/1 (predictions)
## i Fold4: preprocessor 1/1
## ✓ Fold4: preprocessor 1/1
## i Fold4: preprocessor 1/1, model 1/1
## ✓ Fold4: preprocessor 1/1, model 1/1
## i Fold4: preprocessor 1/1, model 1/1 (predictions)
## i Fold5: preprocessor 1/1
## ✓ Fold5: preprocessor 1/1
## i Fold5: preprocessor 1/1, model 1/1
## ✓ Fold5: preprocessor 1/1, model 1/1
## i Fold5: preprocessor 1/1, model 1/1 (predictions)
collect_metrics(tune_res_knn)
## # A tibble: 20 × 7
## neighbors .metric .estimator mean n std_err .config
## <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 1 rmse standard 5.88 5 0.0800 Preprocessor1_Model01
## 2 1 rsq standard 0.263 5 0.0150 Preprocessor1_Model01
## 3 2 rmse standard 5.45 5 0.0785 Preprocessor1_Model02
## 4 2 rsq standard 0.301 5 0.0150 Preprocessor1_Model02
## 5 3 rmse standard 5.14 5 0.0793 Preprocessor1_Model03
## 6 3 rsq standard 0.335 5 0.0153 Preprocessor1_Model03
## 7 4 rmse standard 4.94 5 0.0812 Preprocessor1_Model04
## 8 4 rsq standard 0.360 5 0.0158 Preprocessor1_Model04
## 9 5 rmse standard 4.82 5 0.0823 Preprocessor1_Model05
## 10 5 rsq standard 0.377 5 0.0160 Preprocessor1_Model05
## 11 6 rmse standard 4.73 5 0.0820 Preprocessor1_Model06
## 12 6 rsq standard 0.390 5 0.0161 Preprocessor1_Model06
## 13 7 rmse standard 4.67 5 0.0808 Preprocessor1_Model07
## 14 7 rsq standard 0.400 5 0.0159 Preprocessor1_Model07
## 15 8 rmse standard 4.63 5 0.0793 Preprocessor1_Model08
## 16 8 rsq standard 0.407 5 0.0156 Preprocessor1_Model08
## 17 9 rmse standard 4.59 5 0.0777 Preprocessor1_Model09
## 18 9 rsq standard 0.413 5 0.0152 Preprocessor1_Model09
## 19 10 rmse standard 4.57 5 0.0767 Preprocessor1_Model10
## 20 10 rsq standard 0.418 5 0.0150 Preprocessor1_Model10
autoplot(tune_res_knn)
best_neighbors_knn <- select_by_one_std_err(tune_res_knn, desc(neighbors), metric = "rmse");best_neighbors_knn
## # A tibble: 1 × 9
## neighbors .metric .estimator mean n std_err .config .best .bound
## <int> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 10 rmse standard 4.57 5 0.0767 Preprocessor1_M… 4.57 4.64
en_mod <- linear_reg(mixture = tune(),
penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
en_workflow <- workflow() %>%
add_recipe(Soccer_recipe) %>%
add_model(en_mod)
en_grid <- grid_regular(penalty(range = c(0, 1),
trans = identity_trans()),
mixture(range = c(0, 1)),
levels = 10)
tune_res_en <- tune_grid(
en_workflow,
resamples = Soccer_folds,
grid = en_grid
)
collect_metrics(tune_res_en)
## # A tibble: 200 × 8
## penalty mixture .metric .estimator mean n std_err .config
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0 0 rmse standard 4.26 5 0.0712 Preprocessor1_Model001
## 2 0 0 rsq standard 0.488 5 0.0138 Preprocessor1_Model001
## 3 0.111 0 rmse standard 4.26 5 0.0712 Preprocessor1_Model002
## 4 0.111 0 rsq standard 0.488 5 0.0138 Preprocessor1_Model002
## 5 0.222 0 rmse standard 4.26 5 0.0712 Preprocessor1_Model003
## 6 0.222 0 rsq standard 0.488 5 0.0138 Preprocessor1_Model003
## 7 0.333 0 rmse standard 4.27 5 0.0710 Preprocessor1_Model004
## 8 0.333 0 rsq standard 0.486 5 0.0137 Preprocessor1_Model004
## 9 0.444 0 rmse standard 4.28 5 0.0709 Preprocessor1_Model005
## 10 0.444 0 rsq standard 0.483 5 0.0137 Preprocessor1_Model005
## # … with 190 more rows
autoplot(tune_res_en)
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis
best_en <- select_by_one_std_err(tune_res_en,
metric = "rmse", penalty, mixture);best_en
## # A tibble: 1 × 10
## penalty mixture .metric .estimator mean n std_err .config .best .bound
## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 0 0 rmse standard 4.26 5 0.0712 Preproces… 4.23 4.30
ridge_reg <- linear_reg(mixture = 0,
penalty = tune()) %>%
set_mode("regression") %>%
set_engine("glmnet")
ridge_workflow <- workflow() %>%
add_recipe(Soccer_recipe) %>%
add_model(ridge_reg)
penalty_grid <- grid_regular(penalty(range = c(-5,5)), levels = 20)
tune_res_ridge <- tune_grid(
ridge_workflow,
resamples = Soccer_folds,
grid = penalty_grid
)
## ! Fold1: internal:
## There were 4 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 17: `penalty = 2636.651`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold2: internal:
## There were 4 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 17: `penalty = 2636.651`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold3: internal:
## There were 4 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 17: `penalty = 2636.651`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold4: internal:
## There were 4 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 17: `penalty = 2636.651`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold5: internal:
## There were 4 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 17: `penalty = 2636.651`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
collect_metrics(tune_res_ridge)
## # A tibble: 40 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00001 rmse standard 4.26 5 0.0712 Preprocessor1_Model01
## 2 0.00001 rsq standard 0.488 5 0.0138 Preprocessor1_Model01
## 3 0.0000336 rmse standard 4.26 5 0.0712 Preprocessor1_Model02
## 4 0.0000336 rsq standard 0.488 5 0.0138 Preprocessor1_Model02
## 5 0.000113 rmse standard 4.26 5 0.0712 Preprocessor1_Model03
## 6 0.000113 rsq standard 0.488 5 0.0138 Preprocessor1_Model03
## 7 0.000379 rmse standard 4.26 5 0.0712 Preprocessor1_Model04
## 8 0.000379 rsq standard 0.488 5 0.0138 Preprocessor1_Model04
## 9 0.00127 rmse standard 4.26 5 0.0712 Preprocessor1_Model05
## 10 0.00127 rsq standard 0.488 5 0.0138 Preprocessor1_Model05
## # … with 30 more rows
autoplot(tune_res_ridge)
best_ridge <- select_by_one_std_err(tune_res_ridge,
metric = "rmse", penalty);best_ridge
## # A tibble: 1 × 9
## penalty .metric .estimator mean n std_err .config .best .bound
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 0.00001 rmse standard 4.26 5 0.0712 Preprocessor1_Mod… 4.26 4.33
lasso_reg <- linear_reg(penalty = tune(),
mixture = 1) %>%
set_mode("regression") %>%
set_engine("glmnet")
lasso_workflow <- workflow() %>%
add_recipe(Soccer_recipe) %>%
add_model(ridge_reg)
penalty_grid <- grid_regular(penalty(range = c(-5,5)), levels = 20)
tune_res_lasso <- tune_grid(
lasso_workflow,
resamples = Soccer_folds,
grid = penalty_grid
)
## ! Fold1: internal:
## There were 4 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 17: `penalty = 2636.651`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold2: internal:
## There were 4 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 17: `penalty = 2636.651`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold3: internal:
## There were 4 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 17: `penalty = 2636.651`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold4: internal:
## There were 4 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 17: `penalty = 2636.651`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold5: internal:
## There were 4 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 17: `penalty = 2636.651`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
collect_metrics(tune_res_lasso)
## # A tibble: 40 × 7
## penalty .metric .estimator mean n std_err .config
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 0.00001 rmse standard 4.26 5 0.0712 Preprocessor1_Model01
## 2 0.00001 rsq standard 0.488 5 0.0138 Preprocessor1_Model01
## 3 0.0000336 rmse standard 4.26 5 0.0712 Preprocessor1_Model02
## 4 0.0000336 rsq standard 0.488 5 0.0138 Preprocessor1_Model02
## 5 0.000113 rmse standard 4.26 5 0.0712 Preprocessor1_Model03
## 6 0.000113 rsq standard 0.488 5 0.0138 Preprocessor1_Model03
## 7 0.000379 rmse standard 4.26 5 0.0712 Preprocessor1_Model04
## 8 0.000379 rsq standard 0.488 5 0.0138 Preprocessor1_Model04
## 9 0.00127 rmse standard 4.26 5 0.0712 Preprocessor1_Model05
## 10 0.00127 rsq standard 0.488 5 0.0138 Preprocessor1_Model05
## # … with 30 more rows
autoplot(tune_res_lasso)
best_lasso <- select_by_one_std_err(tune_res_lasso,
metric = "rmse", penalty);best_lasso
## # A tibble: 1 × 9
## penalty .metric .estimator mean n std_err .config .best .bound
## <dbl> <chr> <chr> <dbl> <int> <dbl> <chr> <dbl> <dbl>
## 1 0.00001 rmse standard 4.26 5 0.0712 Preprocessor1_Mod… 4.26 4.33
rf_spec <- rand_forest(mtry = tune(),
trees = tune(),
min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("regression")
rf_workflow <- workflow() %>%
add_recipe(Soccer_recipe) %>%
add_model(rf_spec)
rf_parameter_grid <- grid_regular(mtry(range = c(4, 8)), trees(range = c(200,800)), min_n(range = c(30,50)), levels = 8)
rf_tune_res <- tune_grid(
rf_workflow,
resamples = Soccer_folds,
grid = rf_parameter_grid
)
collect_metrics(rf_tune_res)
## # A tibble: 640 × 9
## mtry trees min_n .metric .estimator mean n std_err .config
## <int> <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 4 200 30 rmse standard 4.35 5 0.0690 Preprocessor1_Model…
## 2 4 200 30 rsq standard 0.477 5 0.0140 Preprocessor1_Model…
## 3 5 200 30 rmse standard 4.32 5 0.0683 Preprocessor1_Model…
## 4 5 200 30 rsq standard 0.483 5 0.0135 Preprocessor1_Model…
## 5 6 200 30 rmse standard 4.30 5 0.0643 Preprocessor1_Model…
## 6 6 200 30 rsq standard 0.485 5 0.0125 Preprocessor1_Model…
## 7 7 200 30 rmse standard 4.28 5 0.0673 Preprocessor1_Model…
## 8 7 200 30 rsq standard 0.487 5 0.0131 Preprocessor1_Model…
## 9 8 200 30 rmse standard 4.27 5 0.0691 Preprocessor1_Model…
## 10 8 200 30 rsq standard 0.487 5 0.0133 Preprocessor1_Model…
## # … with 630 more rows
autoplot(rf_tune_res)
best_rf <- select_by_one_std_err(rf_tune_res, metric = "rmse",mtry,trees,min_n)
bt_reg <- boost_tree(mtry = tune(),
trees = tune(),
learn_rate = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
bt_reg_workflow <- workflow() %>%
add_model(bt_reg) %>%
add_recipe(Soccer_recipe)
bt_grid <- grid_regular(mtry(range = c(1, 6)),
trees(range = c(200, 600)),
learn_rate(range = c(-10, -1)),
levels = 5)
bt_tune_res <- tune_grid(
bt_reg_workflow,
resamples = Soccer_folds,
grid = bt_grid
)
## ! Fold1: internal:
## There were 5 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 1`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 2`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 3`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 4`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 6`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold2: internal:
## There were 5 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 1`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 2`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 3`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 4`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 6`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold3: internal:
## There were 5 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 1`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 2`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 3`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 4`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 6`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold4: internal:
## There were 5 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 1`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 2`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 3`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 4`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 6`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## ! Fold5: internal:
## There were 5 warnings in `dplyr::summarise()`.
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 1`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 2`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 3`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 4`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
## The first warning was:
## ℹ In argument: `.estimate = metric_fn(truth = overall, estimate = .pre...
## = na_rm)`.
## ℹ In group 1: `mtry = 6`, `trees = 200`, `learn_rate = 1e-10`.
## Caused by warning:
## ! A correlation computation is required, but `estimate` is constant an...
## ℹ Run ]8;;ide:run:dplyr::last_dplyr_warnings()dplyr::last_dplyr_warn...
autoplot(bt_tune_res)
show_best(bt_tune_res, n = 1)
## Warning: No value of `metric` was given; metric 'rmse' will be used.
## # A tibble: 1 × 9
## mtry trees learn_rate .metric .estimator mean n std_err .config
## <int> <int> <dbl> <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 3 200 0.1 rmse standard 4.25 5 0.0568 Preprocessor1_M…
best_bt <- select_by_one_std_err(bt_tune_res, metric = "rmse",mtry,trees,learn_rate)
It is time to check how our models do comparing with each other!
Overall_results <- bind_rows(best_lm, best_neighbors_knn, best_en, best_ridge, best_lasso, best_rf, best_bt) %>%
tibble() %>%
mutate(model = c("Linear Regression", "K-nearest Neighbors", "Elastic net", "Ridge Regression", "Lasso Regression", "Random Forest", "Boosted Trees")) %>%
arrange(mean)%>%
select(model,mean) %>%
mutate(mean = round(mean, 5))
Overall_results
## # A tibble: 7 × 2
## model mean
## <chr> <dbl>
## 1 Linear Regression 4.23
## 2 Elastic net 4.26
## 3 Ridge Regression 4.26
## 4 Lasso Regression 4.26
## 5 Boosted Trees 4.27
## 6 Random Forest 4.32
## 7 K-nearest Neighbors 4.57
It looks like the Linear Regression gives us the best
rmse and become our most ideal model to use on the testing data set.
This also makes sense because I have captured many potential
collinearity in our predictors during the step of recipe set up, as we
having a sense that there are a fair amount of correlation between our
predictor in our correlation plot, Linear Regression finally give us the
best results. Now, we proceed to the final fitting step. Allmost
There!
lm_final_workflow_train <- finalize_workflow(lm_workflow, best_lm)
rf_final_fit_train <- fit(lm_final_workflow_train, data = Soccer_train)
Soccer_tibble <- predict(rf_final_fit_train, new_data = Soccer_test %>% select(-overall))
Soccer_tibble <- bind_cols(Soccer_tibble, Soccer_test %>% select(overall))
rmse(Soccer_tibble, truth = overall, .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 4.21
Based on our calculated rmse, it is out of my expectation since it has done a even better job compare to the training data. In addition, since the observations are ranged from 0 to 100, the Linear Regression model hasn’t done a bad job.
Soccer_tibble %>%
ggplot(aes(x = .pred, y = overall)) +
geom_point(alpha = 0.4) +
geom_abline(lty = 2) +
theme_grey() +
coord_obs_pred() +
labs(title = "Predicted Values vs. Actual Values")
Woah! That’s some very expressive result right there.
If most of our points are on the diagnal axis on this predicted values
vs. actual values graph, that means we have reached a pretty decent
precision in predicting the actual values.
In conclusion, My Machine Learning project has been a resounding success in achieving a high level of precision using the Linear Regression model. I successfully designed, developed, and deployed a robust and accurate prediction model that could handle the complexities of our dataset.
One of the key factors in the success was the careful preprocessing of my data sets. I spent a significant amount of time cleaning, formatting and combining the datasets to ensure that it was sufficient and suitable for analysis.
After preprocessing, I experimented with different models and techniques to determine the best approach for my dataset. The Linear Regression model was the clear winner, outperforming other popular models such as random forests and support vector machines. I then fine-tuned the parameters of the model to achieve the lowest possible Root Mean Square Error (RMSE), which is a standard metric used to evaluate the accuracy of regression models.
Looking ahead, I believe that further research and experimentation
with different models and techniques can improve the performance of my
system and enhance its applicability. In summary, I am incredibly proud
and thankful of the work I have accomplished on this project, especially
the help from the professor Doctor Coburn. I am excited about the
possibilities that Machine Learning can offer in the future, and I’m
looking forward to continuing exploring in this exciting field.